INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Define The Problem

1) Predicting which booking is likely to be canceled.

2) Identifying factors that have a high influence on booking cancellations.

3) Formulate reccomendations for reducing cancellations, or reducing sunk cost resulting from cancellations.

Importing necessary libraries and data

Data Overview

Take a first look at the data

Observations

Exploratory Data Analysis (EDA)

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Leading Questions Answered:

  1. What are the busiest months in the hotel?

    Month 10 = October with 14.7% of the total booking for the year.

  2. Which market segment do most of the guests come from?

    Online 23214 or 64% of the bookings come via the internet.

  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

    Online booking are the highest despite also having the highest amount of free rooms (I suppose they are redeemed from online retailers points systems) Aviation, Offline, and Corporate are generally slightly lower priced with Corporate edging out for the lowest. Complimentary are of course free.

  4. What percentage of bookings are canceled?

    about 1/3 (11885) of bookings are canceled in the sample data.

  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

    Repeating guest rarely cancel (1.75%).

  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

    The absence of special request increases the likelihood of cancellation, the addition of special request begins to reduce the likelihood of cancellation at one and progressively reduces cancellation to Zero on the instance of a third request.

Univariate

Data Preprocessing

There are two heavy outlier columns, lead_time & avg_room_price. I will only treat avg_room_price as a log because I am going to bin lead time and that should handle those outliers.

EDA

EDA Insights

Checking Multicollinearity

Building a Logistic Regression model

Model performance evaluation

Final Model Summary

Building a Decision Tree model

Prune the Model

Let's use GridSearch to hyperparameter tune the model
Cost Complexity Pruning

-Still looking for Recall not accurancy so we loook at the DT Classifier

Actionable Insights and Recommendations

The three most important variables in terms of cancellations were the lead time, meaning how far in advance they booked the room(s), special request for the stay, and average price of the room. Rooms booked in advance of 151 days (5 months) or less were much less likely to cancel the reservation. Those who made a special request on top of that were very unlikely to cancel. This I believe is an opportunity. Rooms booked over 151 days were more likely to cancel. Price was the determining factor for those cancellations. As the likelihood of a cancelation was increased if the room was priced over 100.04 Euros. Leading me to believe that booked early and then subsequently found a better deal.

My Recommendations